26 research outputs found
A Systematic Comparison of English Noun Compound Representations
Building meaningful representations of noun compounds is not trivial since
many of them scarcely appear in the corpus. To that end, composition functions
approximate the distributional representation of a noun compound by combining
its constituent distributional vectors. In the more general case, phrase
embeddings have been trained by minimizing the distance between the vectors
representing paraphrases. We compare various types of noun compound
representations, including distributional, compositional, and paraphrase-based
representations, through a series of tasks and analyses, and with an extensive
number of underlying word embeddings. We find that indeed, in most cases,
composition functions produce higher quality representations than
distributional ones, and they improve with computational power. No single
function performs best in all scenarios, suggesting that a joint training
objective may produce improved representations.Comment: MWE workshop @ ACL 201
Breaking NLI Systems with Sentences that Require Simple Lexical Inferences
We create a new NLI test set that shows the deficiency of state-of-the-art
models in inferences that require lexical and world knowledge. The new examples
are simpler than the SNLI test set, containing sentences that differ by at most
one word from sentences in the training set. Yet, the performance on the new
test set is substantially worse across systems trained on SNLI, demonstrating
that these systems are limited in their generalization ability, failing to
capture many simple inferences.Comment: 6 pages, short paper at ACL 201
Automatic Evaluation of Generative Models with Instruction Tuning
Automatic evaluation of natural language generation has long been an elusive
goal in NLP.A recent paradigm fine-tunes pre-trained language models to emulate
human judgements for a particular task and evaluation criterion. Inspired by
the generalization ability of instruction-tuned models, we propose a learned
metric based on instruction tuning. To test our approach, we collected HEAP, a
dataset of human judgements across various NLG tasks and evaluation criteria.
Our findings demonstrate that instruction tuning language models on HEAP yields
good performance on many evaluation tasks, though some criteria are less
trivial to learn than others. Further, jointly training on multiple tasks can
yield additional performance improvements, which can be beneficial for future
tasks with little to no human annotated data.Comment: 11 pages, 1 figur
GD-COMET: A Geo-Diverse Commonsense Inference Model
With the increasing integration of AI into everyday life, it's becoming
crucial to design AI systems that serve users from diverse backgrounds by
making them culturally aware. In this paper, we present GD-COMET, a geo-diverse
version of the COMET commonsense inference model. GD-COMET goes beyond Western
commonsense knowledge and is capable of generating inferences pertaining to a
broad range of cultures. We demonstrate the effectiveness of GD-COMET through a
comprehensive human evaluation across 5 diverse cultures, as well as extrinsic
evaluation on a geo-diverse task. The evaluation shows that GD-COMET captures
and generates culturally nuanced commonsense knowledge, demonstrating its
potential to benefit NLP applications across the board and contribute to making
NLP more inclusive.Comment: Accepted to EMNLP 2023 Main Conferenc